Title

Roy Longbottom at Linkedin  Four Core Eight Thread Computing Benchmarks

Contents


CPU Only Benchmark Whetstone Benchmark Maximum Data Flow Benchmark
Serial/Random Access Benchmark OpenMP, MP-MFLOPS, QPAR Benchmarks Multiple Tasks

General

Dual Core benchmark code (see DualCore.htm) has been modified to use eight threads, initially intended for measuring performance of four core processors with Hyperthreading, where Windows sees the system as having eight processors. Download QuadCore.zip for benchmark source code and EXE files at 64 bits and 32 bits. Results below include those for a Quad Core Phenom II and a Quad Core i7 with Hyperthreading. With one core in use, the latter processor can run at 3066 MHz using Turbo Boost, but this will be reduced to 2933 MHz, when using more than one core, or to the specified speed or 2800 MHz, if hot. This behaviour makes the effects of Hyperthreading more difficult to determine.

Except for the Whetstone benchmark, which has program loops with few instructions, the test programs have long sequences of streamed data, with some using efficient assembly code. In this case, high performance gains on a quad core processor with Hyperthreading are not really expected when using more than four threads.

To Start


CPUIDMP CPU Only Benchmark

Programs CPUID8Thread32.exe and CPUID8Thread64.exe are the same programs but compiled for 32 and 64 bits. They execute three passes of simple additions to four different registers, via assembly code, attempting to demonstrate maximum CPU speeds. Firstly an integer (INT) and an SSE floating point test are run separately. They are then run as two threads, followed by 2 INT and 2 SSE, 3 INT and 3 SSE then 4 INT and 4 SSE. Further information can be found in WhatCPU Results.htm.

The high speed operation achieved appears to leave a little room to squeeze in additional hyperthreaded instructions on the Core i7. Even using four threads, integer throughput is disappointing and, between four and eight threads, the Phenom appears to be more efficient (on this particular code). The slow i7 speeds could be due to a reduction in Turbo Boost MHz from 3066 to 2933, where maximum gain might be 2933 / 3066 x 4 x 100 = 383%.


   CPU                        Core 2 Athlon 64    Core 2 Phenom II   Core i7   Core i7
   MHz                          1830      2211      2400      3000      ####      ****
   CPUs/Hyperthreads             2/0       2/0       2/0       4/0       4/4       4/4
   Windows                     Vis32     XPx64     Vis64    Win764    Win764    Win864

   Separate Tests
   32 bit SSE   MFLOPS          6781      4400      9222     12020     10178     15460
   32 bit Integer MIPS          4556      6612      6296      9018      8611     12292


   Two Threads Equal Priority
   32 bit SSE   MFLOPS          6777      4384      9266     12003     10176     15460
   32 bit Integer MIPS          5117      6604      6740      9028      8606     12290


   Four Threads, First Normal Priority, Others Normal - 1
   32 bit SSE   MFLOPS          6816      4363      9086     11935     11032     14611
   32 bit SSE   MFLOPS             0      2215         0     11986      9161     14611
   32 bit Integer MIPS          2508      3232      3257      8897      6739     11916
   32 bit Integer MIPS          2642        67      3671      8956      6566     11259

   Total  SSE   MFLOPS          6816      6578      9086     23921     20193     29221
   Total  Integer MIPS          5150      3300      6929     17853     13305     23175

   Gain % SSE                    101       150        99       199       198       189
   Gain % Integer                113        50       110       198       155       189
                                                                   Total 353       378

   Six Threads, All Normal Priority
   32 bit SSE   MFLOPS          2200      1439      3059      5864      8166     15021
   32 bit SSE   MFLOPS          2355      1450      3114     12012      9124     10050
   32 bit SSE   MFLOPS          2358      1488      3111     11946      6653      9695
   32 bit Integer MIPS          1700      2192      2112      4452      4612      8103
   32 bit Integer MIPS          1669      2163      2249      4546      4612      8802
   32 bit Integer MIPS          1699      2257      2450      4519      4041      8658

   Total  SSE   MFLOPS          6913      4376      9284     29822     23942     34766
   Total  Integer MIPS          5068      6612      6811     13517     13265     25563

   Gain % SSE                    102        99       101       248       235       225
   Gain % Integer                111       100       108       150       154       208
                                                                   Total 389       433

   Eight Threads, All Normal Priority
   32 bit SSE   MFLOPS          1705      1083      2283      4077      5445     12063
   32 bit SSE   MFLOPS          1730      1067      2321      5867      5210     13852
   32 bit SSE   MFLOPS          1730      1078      2321     11982      5194     12678
   32 bit SSE   MFLOPS          1728      1130      2314      6141      4693     12593
   32 bit Integer MIPS          1252      1630      1680      4451      4032      6604
   32 bit Integer MIPS          1251      1634      1672      2973      4029      6926
   32 bit Integer MIPS          1411      1639      1671      4495      4036      6582
   32 bit Integer MIPS          1244      1732      1879      2968      4035      6736

   Total  SSE   MFLOPS          6893      4358      9240     28067     20541     51187
   Total  Integer MIPS          5158      6635      6902     14887     16132     26847

   Gain % SSE                    102        99       100       234       202       331
   Gain % Integer                113       100       110       165       187       218
                                                                   Total 389       549

 #### Core i7  930  rated at 2800 MHz but running up to 3066 MHz using Turbo Boost
 **** Core i7 4820K rated at 3700 MHz but running up to 3900 MHz using Turbo Boost


To Start


Whetstone Benchmark

The Whetstone Benchmark has various routines that execute floating point and integer instructions. Speed of individual tests is in terms of Millions of Operations Per Second (MOPS), or MFLOPS for those using simple floating point arithmetic, and an overall rating in Millions of Whetstone Instructions Per Second (MWIPS). Programs Whets8Thread32.exe and Whets8Thread64.exe are the same programs but compiled for 32 and 64 bits. Unlike the dual core variety, this version uses common code and equal priority for all threads to produce more consistent performance. Results and further details can be found in Whetstone Results.htm. Those at 64 bits are somewhat faster due to improved optimisation.

The total (top line) results shown are calculated using a simple sum of speeds for each thread and can be distorted by threads finishing at different times. Using a harmonic mean makes little difference and the overall MWIPS rating is calculated using the sum of elapsed times of tests in all threads. Considering the four core Phenom results, consistent speeds are produced on all test using two and four threads to produce performance gains of 200% and nearly 400%. It is not clear why, but average gains using six and eight threads were around 450%.

The Core i7 produces 200% gain using two threads but less than 400% with four threads, no doubt due to the Turbo Boost clock of 3066 MHz being reduced to the specification speed of 2800 MHz. This benchmark appears to demonstrate Hyperthreading in a most favourable light, producing average gains of around 450%, using six threads, and 700% with eight threads. The main beneficiaries are the floating point tests, in this case translated to SSE code as Single Instruction Single Data (SISD not SIMD/Multiple) operations.

MWIPS MFLOP MFLOP MFLOP COS EXP FIXPT IF EQUAL CPU MHz 1 2 3 MOPS MOPS MOPS MOPS MOPS Phenom II Win7 3000 3115 902 739 716 69.5 49.3 2509 3008 1289 Dual Core Thread 1 902 739 716 69.5 49.3 2509 3008 1289 Phenom II Win7 3000 6229 1811 1480 1432 139 98.6 5007 6022 2578 Dual Core Thread 1 906 738 716 69.5 49.3 2508 3010 1289 Thread 2 905 741 716 69.5 49.3 2499 3012 1288 Gain % 200 201 200 200 200 200 200 200 200 Phenom II Win7 3000 12414 3603 2950 2853 277 196 9988 11992 5139 Dual Core Thread 1 902 735 714 69.1 49.2 2481 2983 1278 Thread 2 903 739 715 69.4 49.0 2501 3000 1287 Thread 3 905 739 710 69.3 49.0 2499 2999 1285 Thread 4 893 736 714 69.5 49.2 2508 3009 1288 Gain % 399 399 399 398 399 398 398 399 399 Phenom II Win7 3000 14101 4322 3550 3374 325 231 12239 14250 6019 Dual Core Thread 1 621 767 725 46.3 49.5 2655 1995 860 Thread 2 613 510 482 46.5 32.8 1722 3009 859 Thread 3 617 496 477 46.4 33.0 1741 2116 862 Thread 4 933 767 726 69.7 49.6 1725 3077 1291 Thread 5 604 505 486 46.3 32.8 2651 2043 854 Thread 6 934 506 477 69.8 33.1 1744 2011 1293 Gain % 453 479 480 471 468 469 488 474 467 Phenom II 8 Threads Similar ################################################################################## Core i7 Win7 #### 3115 1065 886 738 79.3 39.7 2447 2936 1154 Quad Core Thread 1 1065 886 738 79.3 39.7 2447 2936 1154 Core i7 Win7 #### 6228 2130 1773 1474 159 79.4 4894 5872 2308 Quad Core Thread 1 1065 887 737 79.3 39.7 2447 2936 1154 Plus HT Thread 2 1065 886 737 79.3 39.7 2448 2936 1154 Gain % 200 200 200 200 201 200 200 200 200 Core i7 Win7 #### 12043 4243 3529 2930 302 156 9078 10207 4170 Quad Core Thread 1 1059 880 730 75.0 39.4 2102 2332 1018 Plus HT Thread 2 1064 881 733 76.9 38.7 2450 2498 1107 Thread 3 1057 881 729 74.1 38.6 2187 2439 1044 Thread 4 1063 887 738 76.4 39.0 2339 2938 1001 Gain % 387 398 398 397 381 393 371 348 361 Core i7 Win7 #### 17149 6705 5463 4426 422 224 12984 13145 4869 Quad Core Thread 1 1146 919 739 72.3 37.6 2019 1958 816 Plus HT Thread 2 1145 915 736 69.8 37.0 2044 2664 793 Thread 3 1143 916 744 71.8 37.0 2058 2083 793 Thread 4 1111 926 737 68.5 37.6 2398 2023 788 Thread 5 1097 916 742 72.2 37.8 2110 2124 827 Thread 6 1062 872 728 67.8 36.7 2355 2292 852 Gain % 551 630 617 600 532 564 531 448 422 Core i7 Win7 #### 21690 8676 7621 5844 531 291 16643 12027 5034 Quad Core Thread 1 1091 1027 728 66.4 36.5 2050 1501 629 Plus HT Thread 2 1089 1037 742 66.0 36.5 2090 1507 630 Thread 3 1090 946 742 66.8 36.5 2069 1534 631 Thread 4 1092 1037 727 66.6 36.6 2031 1501 630 Thread 5 1042 959 736 66.4 36.5 1912 1483 630 Thread 6 1091 874 723 66.6 36.1 2049 1507 629 Thread 7 1090 867 725 65.6 36.3 2094 1516 631 Thread 8 1091 874 722 66.3 36.3 2350 1476 624 Gain % 696 815 860 792 670 733 680 410 436 #### i7 930 2800 MHz running using Turbo Boost at up to 3066 MHz ################################################################################## Core i7 Win8 $$$$ 3807 1243 1042 931 86.0 52.4 3570 5741 1543 Quad Core Thread 1 1243 1042 931 86.0 52.4 3570 5741 1543 Core i7 Win8 $$$$ 7319 2461 2104 1782 165 100 6984 11103 2953 Quad Core Thread 1 1231 1052 890 82.5 50.2 3490 5551 1476 Plus HT Thread 2 1231 1052 891 82.5 50.2 3494 5552 1477 Gain % 192 198 202 191 192 191 196 193 191 Core i7 Win8 $$$$ 14616 4931 4229 3560 329 201 13868 22200 5899 Quad Core Thread 1 1233 1058 890 82.4 50.1 3443 5551 1476 Plus HT Thread 2 1233 1058 890 82.3 50.2 3494 5552 1475 Thread 3 1232 1056 889 82.1 50.2 3438 5546 1472 Thread 4 1232 1057 890 82.2 50.2 3494 5551 1476 Gain % 384 397 406 382 383 384 388 387 382 Core i7 Win8 $$$$ 20608 7421 6345 5280 459 287 19153 23418 6721 Quad Core Thread 1 1236 1057 881 78.0 47.7 3241 3545 1087 Plus HT Thread 2 1236 1058 882 78.1 47.3 3092 4881 1149 Thread 3 1239 1058 880 75.3 48.8 3216 3378 1176 Thread 4 1240 1058 880 75.8 47.5 3246 3378 1097 Thread 5 1236 1057 880 76.0 48.5 3275 4314 1099 Thread 6 1235 1057 878 75.7 47.5 3084 3922 1113 Gain % 541 597 609 567 534 548 536 408 436 Core i7 Win8 $$$$ 26301 9876 8162 7022 582 372 24785 22207 7493 Quad Core Thread 1 1235 1006 878 72.6 46.4 3099 2776 937 Plus HT Thread 2 1234 1050 876 72.5 46.5 3094 2777 938 Thread 3 1235 1018 878 73.2 46.3 3097 2777 934 Thread 4 1235 976 877 72.9 46.4 3095 2775 937 Thread 5 1235 1034 883 72.7 46.3 3102 2775 937 Thread 6 1235 1028 881 73.0 46.7 3098 2776 938 Thread 7 1233 1017 871 72.7 46.4 3104 2775 934 Thread 8 1233 1033 877 72.8 46.7 3096 2777 938 Gain % 691 795 783 754 677 710 694 387 486 $$$$ i7 4820K 3700 MHz running using Turbo Boost at up to 3900 MHz

To Start


BusMP Maximum Data Flow Benchmark - MBytes/Second

Bus8Thread32.exe and Bus8Thread64.exe are the same programs but compiled for 32 and 64 bits. Results and further details can be found in BusSpd2K Results.htm. One difference is that integers for the the 64 bit version are declared as 64 bits, rather than the default 32. The first results below show major performance differences between the two varieties, where performance in MBytes Per Second can be near twice as fast at 64 bits, indicating a processing speed limitation (64 bit integer arithmetic speed can be same as at 32 bits).

The program starts by reading words with 32 word address increments, to identify memory bus burst reading speed, then reduces the increment to eventually read all words sequentially. Finally, a test loads data to 128 bit SSE registers. Burst reading is mainly at 64 bytes at a time, so maximum speed is likely to be 16 times the MB/second 16 or 8 word increments for 32 or 64 bit numbers. On the single thread results, burst calculations suggest that the Phenom could achieve 7280 MB/second RAM speed from one CPU, similar to that obtained at 128 bits. The figure for the i7 is 11344 MB/second, higher than that achieved. According to the specifications, maximum speeds are 21333 MB/second (at 667 MHz) for the Phenom and 17067 MB/second (at 533 MHz) for the i7. Multi-Thread tests achieve up to 15000 MB/second and nearly 14000 MB/second respectively.

Part two tables show performance and gains using 1, 2, 4, 6 and 8 threads, for all caches and RAM, using the 32 bit compilation, at Inc 32wds, Read All and 128b SSE2. Using 4 or more threads, the Phenom achieves performance gains of 360% to 390% via L1 and L2 caches, around 320% via L3 and 200% to 250% using RAM. With the Core i7, there are only significant gains due to Hyperthreading in the 128 bit SSE L1 cache test. Here, the maximum speed is likely to be one result of 16 Bytes per CPU clock per processor, or 16 x 2800 x 4 = 179,200 MB/second. This was nearly achieved using 8 threads. On the downside, it looks as though the system was trying to use eight lots of 1.5 MB (L3 data) at the same time, forcing data to be read from RAM.

 
  Single Thread    Cache   MHz     Inc     Inc     Inc     Inc     Inc    Read    128b
  Results           RAM          32wds   16wds    8wds    4wds    2wds     All    SSE2
 
  Phenom II    32b   L1   3000   10606   13543   13819   13363   13463   14219   23691
  Phenom II    32b   L2   3000    1496    1495    2957    5972   11352   13145   23798
  Phenom II    32b   L3   3000     659     751    1377    2995    5656    9562   10838
  Phenom II    32b  RAM   3000     439     455     894    1846    3097    5214    7302

  Phenom II    64b   L1   3000   20650   21652   25936   25907   26860   27037   23718
  Phenom II    64b   L2   3000    2922    2970    2992    5927   11859   22500   23881
  Phenom II    64b   L3   3000    1419    1462    1492    2908    5958   11097   11891
  Phenom II    64b  RAM   3000     832     877     911    1784    3676    6237    7360

  Core i7 930  32b   L1   ****   10303    9510    9654    9122    9134    9023   23326
  Core i7 930  32b   L2   ****    1996    2041    3677    5980    8009    8643   22092
  Core i7 930  32b   L3   ****    1948    2004    3608    5848    8074    8614   21650
  Core i7 930  32b  RAM   ****     526     709    1350    2352    4458    7063    9485

  Core i7 930  64b   L1   ****   20105   18713   19136   17974   18126   17910   23345
  Core i7 930  64b   L2   ****    3934    3999    4076    7064   12003   15793   21923
  Core i7 930  64b   L3   ****    3842    3909    4028    6979   11748   15845   21848
  Core i7 930  64b  RAM   ****     949    1048    1419    2736    4698    8812    9459

  Core i7 4820 32b   L1   $$$$   15642   15642   22493   21590   21709   21375   61610
  Core i7 4820 32b   L2   $$$$    2782    2904    5623    9806   17348   20363   40673
  Core i7 4820 32b   L3   $$$$    2741    2821    5499    9736   16795   20679   38331
  Core i7 4820 32b  RAM   $$$$     644     934    1994    3842    8098   13852   15963

  Core i7 4820 64b   L1   $$$$   31565   31291   31178   42042   42508   41978   61606
  Core i7 4820 64b   L2   $$$$    5364    5427    5508   10779   19355   33166   37951
  Core i7 4820 64b   L3   $$$$    5364    5427    5508   10779   19355   33166   37951
  Core i7 4820 64b  RAM   $$$$    1034    1272    1866    4023    7724   16029   15980

 
  L1 Cache Results in MBytes/Second - 6 KB                    % Gain

                 Cache CPUs/  MHz     Inc    Read    128b     Inc   Read   128b
                   KB   HTs         32wds     All    SSE2   32wds    All   SSE2
 
  Phenom II        64   4/0  3000   10606   14219   23691
  764  2 Threads  128               21150   28435   47423     199    200    200
  4 Threads       256               40763   54630   92595     384    384    391
  6 Threads       256               31624   54370   88023     298    382    372
  8 Threads       256               38638   53126   85948     364    374    363

  Core i7 930      32   4/4  ****   10303    9023   23326
  764  2 Threads   64               20590   18031   46677     200    200    200
  4 Threads       128               29499   31104   91726     286    345    393
  6 Threads       128               35391   35846  137181     344    397    588
  8 Threads       128               41300   39292  170513     401    435    731

  Core i7 4820K    32   4/4  $$$$   15642   21375   61610
  864 2 threads    64               31284   42597  123206     200    199    200
  4 Threads       128               39511   70155  238644     253    328    387
  6 Threads       128               54064   88245  309920     346    413    503
  8 Threads       128               62539  107411  402166     400    503    653


  L2 Cache Results in MB/Second - 96 KB                       % Gain

  Phenom II       512   4/0  3000    1496   13145   23798
  2 Threads      1024                2983   26351   47336     199    200    199
  4 Threads      2048                5761   51226   92184     385    390    387
  6 Threads      2048                5863   48050   86055     392    366    362
  8 Threads      2048                5380   48529   85650     360    369    360

  Core i7 930     256   4/4  ****    1996    8643   22092
  2 Threads       512                3378   17305   43722     169    200    198
  4 Threads      1024                3866   26611   60836     194    308    275
  6 Threads      1024                4049   33262   64866     203    385    294
  8 Threads      1024                4178   37228   68711     209    431    311

  Core i7 4820K   256   4/4  $$$$    2782   20363   40673
  2 threads       512                5552   40717   80597     200    200    198
  4 Threads      1024                8984   74935  123866     323    368    305
  6 Threads      1024                9844   83460  143356     354    410    352
  8 Threads      1924               10703   98906  164050     385    486    403


  L3 Cache - 1536 KB Data                                     % Gain

  Phenom II      6144   4/0  3000     659    9562   10838
  2 Threads                          1431   18082   22559     217    189    208
  4 Threads                          2222   29623   34566     337    310    319
  6 Threads                          2221   30682   34525     337    321    319
  8 Threads                          2240   31417   35148     340    329    324

  Core i7 930    8192   4/4  ****    1948    8614   21650
  2 Threads                          3192   17141   42945     164    199    198
  4 Threads                          3772   30387   58809     194    353    272
  6 Threads                          2537   29429   43411     130    342    201
  8 Threads                          1060   19526   15886      54    227     73

  Core i7 4820K 10240   4/4  $$$$    2741   20679   38331
  2 threads                          5343   41353   76302     195    200    199
  4 Threads                          8369   74219  129958     305    359    339
  6 Threads                          7924   73640  123287     289    356    322
  8 Threads                          5467   60140   92112     199    291    240


  RAM Results in MBytes/Second - 128 MB                       % Gain

  Phenom II             4/0  3000     439    5214    7302
  2 Threads                           744    8920   12162     169    171    167
  4 Threads                           913   13000   14952     208    249    205
  6 Threads                           902   13183   15005     205    253    205
  8 Threads                           909   12701   14966     207    244    205

  Core i7 930           4/4  ****     526    7063    9485
  2 Threads                           637   11883   12945     121    168    136
  4 Threads                           724   13600   13828     138    193    146
  6 Threads                           731   13572   13911     139    192    147
  8 Threads                           731   13750   13722     139    195    145

  Core i7 4820K         4/4  $$$$     644   13852   15963
  2 threads                          1135   26066   28578     176    188    179
  4 Threads                          1316   36384   35472     204    263    222
  6 Threads                          1291   36347   36784     200    262    230
  8 Threads                          1374   36504   36414     213    264    228

        **** i7 930   2800 MHz running using Turbo Boost at up to 3066 MHz      
        $$$$ i7 4820K 3700 MHz  Turbo Boost at up to 3900 MHz, RAM max 51.2 MB/s



To Start


RandMP Serial/Random Access Benchmark

Rand8Thread32.exe and Rand8Thread64.exe are compiled from the same program, but for 32 and 64 bits. The program uses the same code for serial and random use via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from L1 cache, L2 cache and RAM using 1, 2, 4, 6 and 8 threads. Results (32 and 64 bit versions) and further details can be found in RandMem Results.htm.

This benchmark uses data from the same array for all threads, but starting at different points. As with the dual core version, with RW and particularly random, flushing dedicated caches to maintain data coherency, leads to reduced performance using more than one thread. Here, speed using shared L2 or L3 cache can be faster than using L1 cache. Results for the 32 bit version below show the total throughput of all threads based on harmonic mean. Data sizes are, again, 6 KB for L1 cache, 96 KB for L2 cache, 1536 KB for L3 cache but 96 MB for RAM.

On the Phenom, speed on serial reading, from caches and RAM, is similar to that for BusMP Read All tests. This also applies via caches for the Core i7 but, using RAM, the data transfer speed appears to be higher than possible, most likely due to efficient caching of shared data (different data starting point probably suits 8 MB L3 cache). This i7 RAM test is the only one where Hyperthreading has a major impact.

Random reading speed via L1 cache is similar to that for serial reading but becomes progressively slower through other caches and RAM. The Core i7 is the faster from L3 cache and RAM using few threads, but the Phenom nearly catches up at 8 threads. The i7 is clearly much faster of the two systems on most read/write tests, but still struggles to achieve a throughput gain of grater than 2.0 using more than two threads. Note that, using one thread on random read/write of L1 cache sized data, the i7 is five times faster than using multiple threads and the Phenom up to ten times faster. For the latter, using data in RAM is faster than data that could sit within L1 cache.


             CPUs          MBytes Per Second Using Threads        Gain At Threads
             /HTs         1       2       4       6       8     2     4     6     8
 Serial RD
 Core i7     4/8 L1   11458   22661   37039   43717   46374   2.0   3.2   3.8   4.0
 930             L2   10380   20832   32853   41711   42839   2.0   3.2   4.0   4.1
 #### MHz        L3    8828   17743   29610   38414   40330   2.0   3.4   4.4   4.6
 Win 764        RAM    4266    8712   17347   24946   25589   2.0   4.1   5.8   6.0

 Serial RW
 Core i7     4/8 L1   15282   13724   16240   16209   18379   0.9   1.1   1.1   1.2
 930             L2   12223   18216   25326   28104   27047   1.5   2.1   2.3   2.2
 #### MHz        L3   10234   19266   21931   24450   26351   1.9   2.1   2.4   2.6
 Win 764        RAM    4533    7656   13876   14543   13390   1.7   3.1   3.2   3.0

 Random RD
 Core i7     4/8 L1   11266   22548   38174   45592   47141   2.0   3.4   4.0   4.2
 930             L2    6233   12463   20059   24986   25667   2.0   3.2   4.0   4.1
 #### MHz        L3    3499    6915    9211   10002    9531   2.0   2.6   2.9   2.7
 Win 764        RAM     459     909    1241    1398    1364   2.0   2.7   3.0   3.0

 Random RW
 Core i7     4/8 L1   14375    3027    2780    2901    3297   0.2   0.2   0.2   0.2
 930             L2    5887    4555    6117    6693    7281   0.8   1.0   1.1   1.2
 #### MHz        L3    3104    4604    4721    5047    4933   1.5   1.5   1.6   1.6
 Win 764        RAM     428     860     899     948    1026   2.0   2.1   2.2   2.4

 #### 2.8 GHz running at up to 3.06 GHz via Turbo Boost, dual channel 1066 MHz DDR3 RAM 

 ##################################################################################
 
             CPUs                 Number Of Threads               Gain At Threads
             /HTs         1       2       4       6       8     2     4     6     8
 Serial RD
 Core i7     4/8 L1   28442   57130  114198  114435  107457   2.0   4.0   4.0   3.8
 4820K           L2   20531   41075   82142   87468   92156   2.0   4.0   4.3   4.5
 $$$$ MHz        L3   17015   34734   69551   77040   81525   2.0   4.1   4.5   4.8
 Win 8.1        RAM    6004   12438   25044   38420   42316   2.1   4.2   6.4   7.0

 Serial RW
 Core i7     4/8 L1   30091   21439   20928   24068   28856   0.7   0.7   0.8   1.0
 4820K           L2   22100   20942   38196   48821   53497   0.9   1.7   2.2   2.4
 $$$$ MHz        L3   17341   33271   65558   60361   73659   1.9   3.8   3.5   4.2
 Win 8.1        RAM   10680   21454   42836   50906   53162   2.0   4.0   4.8   5.0

 Random RD
 Core i7     4/8 L1   27862   55813  111471  111534  104011   2.0   4.0   4.0   3.7
 4820K           L2   13514   27231   54374   54880   59899   2.0   4.0   4.1   4.4
 $$$$ MHz        L3    5557   11141   20900   21977   14510   2.0   3.8   4.0   2.6
 Win 8.1        RAM     627    1238    2472    2533    2479   2.0   3.9   4.0   4.0

 Random RW
 Core i7     4/8 L1   29930    3734    3215    4134    5002   0.1   0.1   0.1   0.2
 4820K           L2    9374    5108    8194    8510    9159   0.5   0.9   0.9   1.0
 $$$$ MHz        L3    4759    7101   12497   13962   13291   1.5   2.6   2.9   2.8
 Win 8.1        RAM     588    1256    2496    2526    2521   2.1   4.2   4.3   4.3

 $$$$ 3.7 GHz running at up to 3.9 GHz via Turbo Boost, quad channel 1600 MHz DDR3 RAM
                        RAM max throughput 51.2 GB/second

 ##################################################################################
 
             CPUs          MBytes Per Second Using Threads        Gain At Threads
             /HTs         1       2       4       6       8     2     4     6     8
 Serial RD
 Phenom II   4/0 L1   15212   29350   58904   58896   54909   1.9   3.9   3.9   3.6
 3000 MHz        L2   12236   24767   49039   50798   47318   2.0   4.0   4.2   3.9
 Win 764         L3    8148   16402   30391   33436   32457   2.0   3.7   4.1   4.0
 1333 MHz DDR3  RAM    3917    6983   11299   12484   12002   1.8   2.9   3.2   3.1
 
 Serial RW
 Phenom II   4/0 L1    7741    5100    5750    6598    6517   0.7   0.7   0.9   0.8
 3000 MHz        L2    7998    5906    7479    8466    8345   0.7   0.9   1.1   1.0
 Win 764         L3    7132   13142    7489    8566    8582   1.8   1.1   1.2   1.2
 1333 MHz DDR3  RAM    3589    5981    8576    7913    7802   1.7   2.4   2.2   2.2
 
 Random RD
 Phenom II   4/0 L1   14367   27877   56817   55300   54129   1.9   4.0   3.8   3.8
 3000 MHz        L2    7250   14355   28436   29723   27962   2.0   3.9   4.1   3.9
 Win 764         L3    1560    3419    6641    7403    7410   2.2   4.3   4.7   4.8
 1333 MHz DDR3  RAM     339     679    1140    1336    1242   2.0   3.4   3.9   3.7
 
 Random RW
 Phenom II   4/0 L1    7585    1381     752     833     757   0.2   0.1   0.1   0.1
 3000 MHz        L2    5985    1624    1230    1387    1245   0.3   0.2   0.2   0.2
 Win 764         L3    1505    1724    1377    1545    1572   1.1   0.9   1.0   1.0
 1333 MHz DDR3  RAM     313     634    1113    1157    1153   2.0   3.6   3.7   3.7


To Start


OpenMP, MP-MFLOPS, QPAR Benchmark MFLOPS

OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers. The benchmark executes the same functions, using the same data sizes, as the CUDA Graphics GPU Parallel Computing Benchmark, with varieties compiled for 32 bit and 64 bit operation, using old style i387 floating point instructions and more recent SSE code (OpenMP32MLOPS.exe and OpenMP64MLOPS.exe). A run time Affinity option is available to execute the benchmark on a selected single processor. These benchmarks and a non-OpenMP SSE version (SSE32MFLOPS.exe) can be downloaded via OpenMPMflops.zip. Results and further details can be found in OpenMP MFLOPS.htm.

The benchmark demonstrates that OpenMP can make use of four CPUs but not much extra on the Core i7 due to Hyperthreading. Each test reads 1000 MB and writes 1000 MB where at least the largest data size of 10M words will be from/to RAM and could be limited by memory speed with 2 floating point operations per word. Two example calculations of MB/second are shown below.

October 2014 additions are for a second Core i7, with a higher speed CPU than the earlier one, much faster memory via four channels, but slower graphics, when executing CUDA calculations with no data communication with the CPUs. The main additions are for MP MFLOPS and QPAR comparative benchmarks, where full details and source code links are in GigaFLOPS Benchmarks.htm. OpenMP uses simple compiler directives to produce automatic multiprocessing. MP MFLOPS uses identical calculations but with additional C code to organise threads. QPAR is Microsoft’s proprietary equivalent OpenMP, just requiring a different compiler parameter.

As with other floating point benchmarks, produced via the same compiler version, OpenMP SSE benchmark did not implement SIMD functions for four simultaneous floating point calculations, but used SSE SISD instructions operating on one variable. OpenMP overheads are also high and the non-OpenMP SSE programs can be faster. Using the same compiler with MP MFLOPS produced programs with lower overheads than OpenMP but similar maximum speeds using 8 threads.

OpenMP and MP MFLOPS were reproduced using the compiler that came with Microsoft Visual Studio 2013. Performance of the former did not improve, but full SIMD instruction were produced for MP MFLOPS (New SSE). Note that maximum GFLOPS is GHz x 4 (SIMD) x 2 (multiply and add) x cores = 3.9 x 4 x 2 x 4 = 124.8 and up to 98.4 GFLOPS was obtained.

This processor provides AVX instructions, produced by a compile option but, unlike a Linux GCC compilation, did not demonstrate speeds, potentially twice as fast as the SSE version. However, the program compiled with the QPAR directive provided performance as good as might be expected (up to 91.2 GFLOPS).


  Core i7 4820K 3.7 GHz running up to 3.9 GHz via Turbo Boost
                                                                CUDA    CUDA
   Data    Ops/ Repeat    SSE    i387    i387 SSE 64b SSE 64b GeForce  No I/O
   Words   Word Passes  1 CPU   1 CPU 4/8 CPU   1 CPU 4/8 CPU  GTX650  GTX650

    100000    2   2500   5172    2155    6705    2963    6690     459    3449
   1000000    2    250   5126    2534    9710    3560    9694     893    8806
  10000000    2     25   4306    2507    9364    3402    9397     980   10530

    100000    8   2500   6170    4146   14613    5605   14591    2375   13056
   1000000    8    250   6163    4460   17569    6077   17622    3545   34014
  10000000    8     25   6130    4453   17885    6088   17887    3896   40905

    100000   32   2500   5848    5124   21243    5745   21188    9183   43975
   1000000   32    250   5841    5203   22372    5863   22839   14006  120972
  10000000   32     25   5847    5218   22350    5860   22391   15499  147100

  maximum Gain                           430%            390%


  Core i7 4820K 3.7 GHz MP MFLOPS and QPAR

   Data    Ops/ Repeat   i387  SSE 64 New SSE New SSE     AVX    QPAR    QPAR
   Words   Word Passes 8 Thrd  8 Thrd  1 Thrd  8 Thrd  8 Thrd  1 Thrd  8 Thrd

    100000    2   2500  15359   19602   10116   58734   58601   10181   42665
   1000000    2    250  15395   19776    9864   43723   43529    9972   38325
  10000000    2     25   9846    9820    5852    9980   10032    5842    9761

    100000    8   2500  23554   24683   24636   97139   85198   24458   75672
   1000000    8    250  23708   24648   24436   98446   98220   24086   88846
  10000000    8     25  23586   24634   19881   40062   40162   19646   37919

    100000   32   2500  23418   23521   23353   91320   93810   23497   88217
   1000000   32    250  23464   23497   23389   93885   93866   23533   91233
  10000000   32     25  23416   23506   23243   93125   93745   23373   86306


 ############################################################################

  Core i7 930 2.8 GHz running at up to 3.06 GHz via Turbo Boost
  Windows 7 64
                                                                CUDA    CUDA
   Data    Ops/ Repeat    SSE    i387    i387 SSE 64b SSE 64b  GeFrce  No I/O
   Words   Word Passes  1 CPU   1 CPU 4/8 CPU   1 CPU 4/8 CPU  GTX480  GTX480

    100000    2   2500   3567    1248    4455    1574    4001     521    5554
   1000000    2    250   3529    1420    5433    1861    4919     819   21493
  10000000    2     25   2388    1364    3038    1735    3076xx  1014   31991

    100000    8   2500   4655    2337    8798    3794   14581    2058   20129
   1000000    8    250   4642    2413    9813    4149   17080    3306   82132
  10000000    8     25   4453    2436    9581    4011   12457    4057  125413

    100000   32   2500   3328    2957   12020    4324   16786    7768   52230
   1000000   32    250   3329    3011   12339    4436   17599   13190  254306
  10000000   32     25   3307    3003   12432    4418   17576yy 16077  425237

  Maximum Gain                           414%            412%

  xx in 0.163 seconds - MB/Second = 2000 / 0.163 = 12270 (x  2/8 for MFLOPS)
  yy in 0.455 seconds - MB/Second = 2000 / 0.455 =  4396 (x 32/8 for MFLOPS) 


 ############################################################################

  Phenom II X4 3.0 GHz, Windows 7 64
                                                                CUDA    CUDA
   Data    Ops/ Repeat    SSE    i387    i387 SSE 64b SSE 64b  GeFrce  No I/O
   Words   Word Passes  1 CPU   1 CPU   4 CPU   1 CPU   4 CPU  GTS250  GTS250

    100000    2   2500   3552    1920    5587    1822    5613     328    3054
   1000000    2    250   3268    1919    5585    1870    7056     625    9672
  10000000    2     25   1861    1625    2993    1563    2972     714   13038

    100000    8   2500   4535    2115    7763    3637   12653    1336   12233
   1000000    8    250   4341    2108    7975    3709   14518    2382   39481
  10000000    8     25   4141    2100    8062    3543   11273    2949   51199

    100000   32   2500   4012    2566    9675    3652   14092    5142   36080
   1000000   32    250   3981    2552   10091    3663   14510    9427  108170
  10000000   32     25   3941    2510    9902    3633   14034   11182  135041

  Maximum Gain                           395%            396%


To Start


Multiple Tasks

Multitasking tests were run on the Core i7 using IntBurn64.exe and SSEBurn64,exe which are described in BurnIn64.htm and BurnIn4CPU.htm. The benchmark and source code are in More64bit.zip. Tests run were one copy each of the Integer and SSE floating point programs, four concurrent copies of the integer test and four copies of both integer and SSE programs at the same time. Test durations were one minute each and results showed that all multitasking tests started and finished within the same clock time second. Each test used L1 cache size data of 8 K. The SSE tests used the Cache Test option, normally the fastest.

Single test result show that the integer test is producing around one 64 bit result per clock Hz and four 32 bit (128 bits) floating point results per Hz using SSE instructions. As might be expected, the higher Turbo Boost CPU clock frequency using one CPU, means that four concurrent integer tests do not achieve a 400% performance level. However, running these eight programs, along with Hyperthreading, increases this to between 428% and 450%.


                        1 Test  ----- 4 Concurrent Tests ----   Total   Gain
   
 Int Write/Read MB/sec  14195   13955   13902   13879   13905   55641   392%
 Int Read       MB/sec  20267   20206   20191   20179   20169   80746   398%

 Int Write/Read MB/sec           8127    8756    8345    8414   33641   237%
 Int Read       MB/sec          10914   10794   10790   11049   43547   215%

 SSE Calculate  MFLOPS  11743    6231    6119    6144    6517   25011   213%


To Start




Roy Longbottom at Linkedin  Roy Longbottom October 2014



The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection